Dynamic Join Product Skew Handling for Hash-Joins in Shared-Nothing Database Systems
نویسندگان
چکیده
When data are uniformly distributed, parallel hash-based join algorithm scales up well. However, the presence of data skew can cause load imbalance among the processors, significantly deteriorating its performance. In this paper we propose a dynamic skew handling algorithm which deals with this load imbalance, by detecting and handling join product skews at run-time. The idea is to monitor the join processing at the join phase and compare the average processing rate of each partition with the rate statically predicted at the scheduling phase. If their difference is detected to be large enough to produce a significant performance degradation, the processor is considered to be overloaded and a workload compensation strategy is dynamically invoked. In this case, based on the measured average processing rate, the amount of overload caused by the unpredicted join product skew is calculated and, the amount of load to be migrated to the non-overloaded processors is determined. We propose two methods the result redistribution and the processing task migration to handle the load migration from the overloaded processor to the non-overloaded processors. Simulation results show that our dynamic skew handling approach can detect and handle load imbalances efficiently, so that the rebalance of load among the processors results in an almost constant join execution time under different join product skews.
منابع مشابه
Handling Data Skew in Multiprocessor Database Computers Using Partition Tuning
Shared nothing multiprocessor archit.ecture is known t.o be more scalable to support very large databases. Compared to other join strategies, a hash-ba9ed join algorithm is particularly efficient and easily parallelized for this computation model. However, this hardware structure is very sensitive to the data skew problem. Unless the parallel hash join algorithm includes some load balancing mec...
متن کاملPractical Skew Handling in Parallel Joins
We present an approach to dealing with skew in parallel joins in database systems. Our approach is easily implementable within current parallel DBMS, and performs well on skewed data without degrading the performance of the system on non-skewed data. The main idea is to use multiple algorithms, each specialized for a di erent degree of skew, and to use a small sample of the relations being join...
متن کاملImplementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework
he Map/Reduce framework-a parallel processing paradigm-is widely being used for large scale distributed data processing. Map/Reduce can perform typical relational database operations like selection, aggregation, and projection etc. However, binary relational operators like join, cartesian product, and set operations are difficult to implement with Map/Reduce. Map/Reduce can process homogeneous ...
متن کاملEfficient Outer Join Data Skew Handling in Parallel DBMS
Large enterprises have been relying on parallel database management systems (PDBMS) to process their ever-increasing data volume and complex queries. The scalability and performance of a PDBMS comes from load balancing on all nodes in the system. Skewed processing will significantly slow down query response time and degrade the overall system performance. Business intelligence tools used by ent...
متن کاملTradeoffs in Processing Complex Join Queries via Hashing Multiprocessor Database Machines
In this paper we examine the problem of processing multi-way join queries (on the order of 10 joins) through hash-based join methods in a shared-nothing database machine. We first discuss how the choice of a format for a complex query can significantly affect performance in a multiprocessor database machine. Several query processing algorithms are then proposed and experimental results obtained...
متن کامل